NBA Player 2014-15 Stats

Let's explore some nba player stats:

comparing two stat categories in a Q->Q scatter plots with best fit linear regression lines and correlation coefficients (also adding 3rd C dimension by labeling player positions)
some C->Q comparisons of stats across guard / forward / center positions



In [1]:

    
import numpy as np
import csv

class Rows:
    """Helper class for dealing with small data sets read in by csv
    """
    def __init__(self, fname):
        with open(fname, 'r') as f:
            reader = csv.reader(f)
            self.colnames = next(reader)
            self.rows = list(reader)
            
    def col(self, *indices, conv=True):
        def col_val(c):
            if len(c) == 0:
                return None
            return float(c) if conv else c
        def row_val(row):
            v = [col_val(row[index]) for index in indices]
            return v
        return [row_val(row) for row in self.rows]

    def col_by_name(self, *colnames, conv=True):
        return self.col(*(self.colnames.index(colname) for colname in colnames), conv=conv)
    
    def clean_rows(self, *colnames, conv=True):
        """
        Returns rows filtering out any row where any column is missing.
        """
        row_vals = self.col_by_name(*colnames, conv=conv)
        clean_row_vals = (row_val for row_val in row_vals if all([c != None for c in row_val]))
        return clean_row_vals

player_stats = Rows('2014-15-player-per-game-averages.csv')

player_stats.colnames









    Out[1]:





['person_id',
 'last_name',
 'first_name',
 'position',
 'height_inches',
 'weight_lbs',
 'min',
 'pts',
 'fg_pct',
 'reb',
 'ast',
 'blk',
 'stl']



In [2]:

    
player_stats.col_by_name('pts')[:10]









    Out[2]:





[[5.9], [7.7], [13.3], [6.5], [2.3], [5.5], [23.4], [5.0], [8.6], [5.6]]



In [3]:

    
player_stats.col_by_name('height_inches')[:10]









    Out[3]:





[[79.0],
 [84.0],
 [77.0],
 [86.0],
 [82.0],
 [83.0],
 [83.0],
 [81.0],
 [76.0],
 [81.0]]

Would be fun to scatter plot different dimensions against each other, e.g to see whether there's a clear relationship between height and rebounds per game. Let's build a helper function to plot.



In [4]:

    
import matplotlib.pyplot as plt
import scipy.stats as stats

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

def plot(rows, x_col_name, y_col_name):
    x_vals, y_vals = zip(*rows.clean_rows(x_col_name, y_col_name))
    plt.scatter(x_vals, y_vals)
    plt.xlabel(x_col_name)
    plt.ylabel(y_col_name)
    slope, intercept, r_value, p_value, std_err = stats.linregress(x_vals, y_vals)
    y_predicted = [intercept + slope*x for x in x_vals]
    plt.plot(x_vals, y_predicted, 'k-', color='red')
    plt.show()
    return r_value

r_value = plot(player_stats, 'height_inches', 'reb')

r_value









    












    Out[4]:





0.55233573948407055

So rebounding is weakly correlated with height. What about weight?



In [5]:

    
r_value = plot(player_stats, 'weight_lbs', 'reb')

r_value









    












    Out[5]:





0.59750128866536678

A little bit more predictive. Let's look at some more relationships for fun.



In [6]:

    
r_value = plot(player_stats, 'height_inches', 'weight_lbs')
r_value









    












    Out[6]:





0.82325617366095505



In [7]:

    
plot(player_stats, 'height_inches', 'blk')









    












    Out[7]:





0.54387765386491582



In [8]:

    
plot(player_stats, 'ast', 'stl')









    












    Out[8]:





0.65664317495189883



In [9]:

    
plot(player_stats, 'min', 'pts')









    












    Out[9]:





0.86582085304162326

Weight is very correlated with height: makes sense as most NBA dudes are ripped and lean.

Let's look at some pairings I wouldn't expect to have much correlation.



In [10]:

    
plot(player_stats, 'pts', 'ast')









    












    Out[10]:





0.58579703010440354



In [11]:

    
plot(player_stats, 'stl', 'blk')









    












    Out[11]:





0.031807556410803559

Finally, let's take a look at some of these where we label the points by position; I bet we'll see a tighter fit within particular positions.



In [12]:

    
set([r[0] for r in player_stats.clean_rows('position', conv=False)])









    Out[12]:





{'Center', 'Forward', 'Guard'}



In [13]:

    
import collections

def plot_with_pos(rows, x_col_name, y_col_name):
    by_pos = collections.defaultdict(list)
    for pos, x, y in rows.clean_rows('position', x_col_name, y_col_name, conv=False):
        by_pos[pos].append((float(x), float(y)))

    r_values = []
    for color, (pos, values) in zip(('red', 'green', 'blue'), by_pos.items()):
        x_vals, y_vals = zip(*values)
        plt.scatter(x_vals, y_vals, color=color, label=pos)
        slope, intercept, r_value, p_value, std_err = stats.linregress(x_vals, y_vals)
        y_predicted = [intercept + slope*x for x in x_vals]
        plt.plot(x_vals, y_predicted, 'k-', color=color)
        r_values.append((pos, r_value))
        
    plt.xlabel(x_col_name)
    plt.ylabel(y_col_name)
    plt.legend(loc='upper left')
    plt.show()
    return r_values

plot_with_pos(player_stats, 'height_inches', 'reb')









    












    Out[13]:





[('Center', 0.13303221752610414),
 ('Guard', 0.17422612413174524),
 ('Forward', 0.2399524539801445)]

Interesting, so once you break it down by position, the correlation between height and rebounding dissappears; being a tall guard apparently isn't really going to help you on the rebounding front. Or perhaps more likely, once you are tall, you are unlikely to be a guard.

Let's check out a few more.



In [14]:

    
plot_with_pos(player_stats, 'weight_lbs', 'reb')









    












    Out[14]:





[('Center', 0.169264457788738),
 ('Guard', 0.28214386602193664),
 ('Forward', 0.33482761191805177)]



In [15]:

    
# reminding myself of the stats available
player_stats.colnames









    Out[15]:





['person_id',
 'last_name',
 'first_name',
 'position',
 'height_inches',
 'weight_lbs',
 'min',
 'pts',
 'fg_pct',
 'reb',
 'ast',
 'blk',
 'stl']



In [16]:

    
plot_with_pos(player_stats, 'ast', 'stl')









    












    Out[16]:





[('Center', 0.50013837907370517),
 ('Guard', 0.65501940965378647),
 ('Forward', 0.6066604618930902)]

The correlation of assists and steals is one of the few that actually holds up when you drill in across positions.

Let's end with come C->Q comparisions by looking at single stats across positions with side-by-side box plots.



In [17]:

    
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}



In [19]:

    
def box_plot_by_pos(rows, col_name):
    by_pos = collections.defaultdict(list)
    for pos, val in rows.clean_rows('position', col_name, conv=False):
        by_pos[pos].append(float(val))
    plt.figure()
    plt.ylabel(col_name)
    pos_in_order = ['Guard', 'Forward', 'Center'] # plot looks better in this order (smallest to largest pos)
    plt.boxplot([by_pos[pos] for pos in pos_in_order], labels=pos_in_order)
    plt.show()
    
for el in ['height_inches', 'weight_lbs', 'min', 'pts', 'fg_pct', 'reb', 'ast', 'blk', 'stl']:
    box_plot_by_pos(player_stats, el)